Logistic regression determines the values of the regression coefficients that are most consistent with the observed data
using what’s called the maximum likelihood criterion. The likelihood of any statistical model is the probability (based on the
model) of obtaining the values you observed in your data. There’s a likelihood value for each row in the data set, and a total
likelihood (L) for the entire data set. The likelihood value for each data point is the predicted probability of getting the observed
outcome result. For individuals who died (refer to Table 18-1), the likelihood is the probability of dying (Y) predicted by the
logistic formula. For individuals who survived, the likelihood is the predicted probability of not dying, which is
. The total
likelihood (L) for the whole set of individuals is the product of all the calculated likelihoods for each individual.
To find the values of the coefficients that maximize L, it is most practical to find the values that minimize the quantity 2
multiplied by the natural logarithm of L, which also called the –2 log likelihood and abbreviated –2LL. Statisticians also call –
2LL the deviance. The closer the curve designated by the regression formula comes to the observed points, the smaller this
deviance value will be. The actual value of the deviance for a logistic regression model doesn’t mean much by itself. It’s the
difference in deviance between two models you might be comparing that is important.
Once deviance is calculated, the final step is to identify the values of the coefficients that will minimize the deviance of the
observed Y values from the fitted logistic curve. This may sound challenging, but statistical programs employ elegant and
efficient ways to minimize such a complicated function involving several variables, and uses these methods to obtain the
coefficients.
Handling multiple predictors in your logistic model
The data in Table 18-1 have only one predictor variable, but you may have several predictors of a
binary outcome. If the data in Table 18-1 were about humans, you would assume the chance of dying
from radiation exposure may depend not only on the radiation dose received, but also on age, gender,
weight, general health, radiation wavelength, and the amount of time over which the person was
exposed to radiation. In Chapter 17, we describe how the straight-line regression model can be
generalized to handle multiple predictors. You can generalize the logistic formula to handle multiple
predictors in the same way.
Suppose that the outcome variable Y is dependent on three predictors called X, V, and W. Then the
multivariate logistic model looks like this:
Logistic regression finds the best-fitting values of the parameters a, b, c, and d given your data. That
way, for any particular set of values for X, V, and W, you can use the equation to predict Y, which is the
probability of being positive for the outcome.
Running a Logistic Regression Model with
Software
The theory behind logistic regression is difficult to grasp, and the calculations are complicated
(see the sidebar “Getting into the nitty-gritty of logistic regression” for details). The good news is
that most statistical software (as described in Chapter 4) can run a logistic regression model, and
it is similar to running a straight-line or multiple linear regression model (see Chapters 16 and
17). Here are the steps:
1. Make sure your data set has a column for the outcome variable that is coded as 1 where the